Creating real R packages

The examples here are meant to be instructional, and the goal is to be clear and simple, not to obtain the best speed or memory usage except when it is clearly stated. For example, the OLS example uses a naive approach to solve the problem, which is not the way R’s lm() function does it.

Please see each package’s README for more information.

Creating an R package with C++ code

Originally available in my blog.

Motivation

Using a compiled language such as C++ is justified when it is challenging to vectorize R code, or when the code is computationally intensive and we need repeated calls of the same function that justify the time invested in re-writing R code code in C++.

Setup

Ubuntu and its derived distributions use gcc as the default C++ compiler. I will use g++ just for consistency with what is used at the University of Toronto in the ECE244 course.

According to Ubuntu documentation: “When you compile C++ programs, you should invoke GCC as g++ instead.”

I installed the R packages cpp11 and usethis:

install.packages(c("cpp11", "usethis"))

I created a file ~/.Rprofile containing the following lines:

library(devtools)
library(usethis)
library(cpp11)

Now forget about devtools::install(). After reopening your editor, every time you use RStudio (or VSCode) you just call install(), and the same applies to usethis::use_*() and cpp11::cpp_*() functions.

Up to this point I still had the following error messages when compiling C++ code:

fatal error: 'cstdio' file not found
fatal error: 'vector' file not found
cannot find -lc++abi: No such file or directory

I had to install additional packages. This took me a few hours searching on the Internet until I figured it out. Install the following packages:

sudo apt install g++-11 libc++-11-dev libc++abi-11-dev

To be sure that the install() function in R uses g++, I created the ~/.R/Makevars file. The contents of the file are the following:

CC = gcc
CXX = g++
CXX98 = g++
CXX11 = g++
CXX14 = g++
CXX14 = g++
CXX17 = g++
CXX20 = g++
CXXCPP = g++
OBJC = gcc
OBJCXX = g++
SHLIB_CXXLD = g++

# USE -O0 for debugging
# USE -O3 for production code
CXXFLAGS=-Wall -O3 -pedantic
CXX11FLAGS=-Wall -O3 -pedantic
CXX14FLAGS=-Wall -O3 -pedantic
CXX17FLAGS=-Wall -O3 -pedantic
CXX20FLAGS=-Wall -O3 -pedantic

A more flexible approach is to edit src/Makevars within the package folder, and add the following lines:

PKG_CXXFLAGS = -Wall -O0 -pedantic

In the CXXFLAGS I use -O0 to avoid optimization, which is useful for debugging. After the code is working, I can change it to -O3 to optimize the compiled code.

If later on I need to compile with gcc, I can open ~/.R/Makevars, comment all the lines, restart RStudio or VSCode, and run install() again.

If you close RStudio (or VSCode) and open it again, you can check that the changes were implemented by running this code:

cpp11::cpp_source(
  code = "
    #include <cpp11.hpp>

    using namespace cpp11;

    [[cpp11::register]] int plusone(int x)
    {
        return x + 1;
    }",
  quiet = FALSE
)
using C++ compiler: ‘g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0’
using C++11
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I'/home/pacha/R/x86_64-pc-linux-gnu-library/4.4/cpp11/include'      -fpic  -g -O2 -ffile-prefix-map=/build/r-base-2y82rL/r-base-4.4.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2   -c /tmp/RtmpWztjQ8/file2278e58f5ee32/src/code_2278e7f036a0f.cpp -o /tmp/RtmpWztjQ8/file2278e58f5ee32/src/code_2278e7f036a0f.o
g++ -std=gnu++11 -I"/usr/share/R/include" -DNDEBUG -I'/home/pacha/R/x86_64-pc-linux-gnu-library/4.4/cpp11/include'      -fpic  -g -O2 -ffile-prefix-map=/build/r-base-2y82rL/r-base-4.4.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2   -c /tmp/RtmpWztjQ8/file2278e58f5ee32/src/cpp11.cpp -o /tmp/RtmpWztjQ8/file2278e58f5ee32/src/cpp11.o
g++ -std=gnu++11 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -o /tmp/RtmpWztjQ8/file2278e58f5ee32/src/code_2278e7f036a0f.so /tmp/RtmpWztjQ8/file2278e58f5ee32/src/code_2278e7f036a0f.o /tmp/RtmpWztjQ8/file2278e58f5ee32/src/cpp11.o -L/usr/lib/R/lib -lR

The output should start with:

using C++ compiler: ‘g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0’

Instead of:

using C compiler: ‘gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0’

Creating a dummy package

From RStudio (or VSCode) we can create a new package by running create_package("~/cpp11dummypackage"). This will create a new folder with the name cpp11dummypackage. Then I run use_cpp11() to add the required files to use C++ code within R.

Then I run use_r("cpp11dummypackage-package") to create a new R script file with the name cpp11dummypackage-package.R within the R folder, and added the following code to it:

#' @useDynLib cpp11dummypackage, .registration = TRUE
NULL

The usethis skeleton also created the file src/code.cpp for us. I added a simple function to transpose a matrix to it, by replacing the file contents by the following lines:

#include <cpp11.hpp>

using namespace cpp11;
    
[[cpp11::register]] doubles_matrix<> transpose_(doubles_matrix<> X)
{
    int NX = X.nrow();
    int MX = X.ncol();

    writable::doubles_matrix<> R(MX, NX);

    for (int i = 0; i < MX; i++)
    {
        for (int j = 0; j < NX; j++)
        {
            R(i, j) = X(j, i);
        }
    }

    return R;
}

In order to export the function, I added the following lines to cpp11dummypackage-package.R:

#' Transpose a matrix
#' @export
#' @param X numeric matrix
#' @return numeric matrix
#' @examples
#' set.seed(1234)
#' X <- matrix(rnorm(4), nrow = 2, ncol = 2)
#' X
#' transpose(X)
transpose <- function(X) {
  transpose_(X)
}

I tested the functions after running cpp11_register() and load_all():

> set.seed(1234)

> X <- matrix(rnorm(4), nrow = 2, ncol = 2)

> X
           [,1]      [,2]
[1,] -1.2070657  1.084441
[2,]  0.2774292 -2.345698

> transpose(X)
          [,1]       [,2]
[1,] -1.207066  0.2774292
[2,]  1.084441 -2.3456977

If I would have passed 1:4 instead of rnorm(4) to matrix(), I would have obtained the following error message:

> transpose(X)
Error: Invalid input type, expected 'double' actual 'integer'

This is because I declared the function to accept a doubles_matrix<> as input, and not an integers_matrix<>.

To install the recently created package, I run the following lines in the R console:

clean_dll()
cpp_register()
document()
install()

Then I can use the package in a new R session or another computer as follows:

# install_github("pachadotdev/cpp11dummypackage")
library(cpp11dummypackage)

X <- matrix(rnorm(4), nrow = 2, ncol = 2)

transpose(X)

Debugging the package

In order to access debugging symbols, I created a new Makevars file within the src folder, and added the following lines:

CXX_STD = CXX11
PKG_CPPFLAGS = -UDEBUG -g

Then I reinstalled the package compiled with debugging symbols, and in bash I run R -d lldb-11. From there I could follow this excellent guide to debug R and C++ code.

You shouldn’t generally leave the -g flag on in a Makevars file, that will insert trace symbols in the compiled binary, both increasing compilation times (often by a large margin), and creating larger binaries. Once the package is compiled and I am sure that it works properly, I need to remove the PKG_CPPFLAGS = -UDEBUG -g line.

Instructional examples

Solving a matrix using the Gauss-Jordan method

See the cpp11gaussjordan package.

Topics covered:

Details

This implementation is a naive approach, but it can be used, for example, to obtain the Ordinary Least Squares (OLS) estimator as shown in the next section.

Vendoring means that I copied the code for the dependencies into the project’s source tree (e.g., I copied the C++ headers provided by the cpp11 package into my own R package). This ensures the dependency code is fixed and stable until it is updated.

The advantage of vendoring is that changes to the cpp11 package could never break your existing code. The disadvantage is that you no longer get bugfixes and new features until you manually run cpp_vendor() in your project.

Building and testing

I used devtools to build and test the package:

# build

devtools::clean_dll()
cpp11::cpp_register()
devtools::document()
devtools::load_all()

# test

A <- matrix(c(2,1,3,-1), nrow = 2, ncol = 2)
invert_matrix(A)

> invert_matrix(A)
     [,1] [,2]
[1,]  0.2  0.6
[2,]  0.2 -0.4

Creating an R package with C++ code and vendoring

I started with create_package("~/github/cpp11gaussjordan"). I used VSCode but all my steps also apply to RStudio.

After opening ~/github/cpp11gaussjordan I ran use_cpp11() to have a readily availably skeleton in my project.

I ran use_apache_licence() to have a LICENSE file and indicate in DESCRIPTION that my package is distributed under the Apache License.

Then I ran cpp_vendor() to copy the C++ headers into inst/include.

Naive Ordinary Least Squares (OLS) estimator

See the cpp11ols package.

Topics covered:

Details

This implementation is extremely naive, but it is enough to show how to use C++ code within R. Please see it from my GitHub profile.

My approach was to create one function per step, which meant to create one function to obtain \(X^tX\), another for \((X^tX)^{-1}\) which consisted in implementing the Gauss-Jordan method to invert a matrix, another for \(X^tY\) and then call each of those functions to obtain \(\hat{\beta} = (X^tX)^{-1}(X^tY)\).

A good challenge would be to implement the QR decomposition used by the lm() function in R and use it to obtain the OLS estimator in C++. This would require some effort, but here you can find a good starting point.

In any case, it would be extremely hard to beat the performance of the lm() function in R, which has some internals written in C, and how computationally robust lm() is means another feature that is hard to beat.

Linear programming (Simplex phase 2)

See the cpp11simplex package.

Topics covered:

Algorithm

The simplex algorithm is well described in Introduction to Linear Optimization and there is efficient software to solve this.

A problem written in canonical form is represented by a table such as:

\[ \begin{array}{ccc|c} x_1 & \cdots & x_n & \\ \hline c_1 & \cdots & c_n & -z \\ a_{11} & \cdots & a_{1n} & b_1 \\ \vdots & \ddots & \vdots & \vdots \\ a_{m1} & \cdots & a_{mn} & b_m \end{array} \]

where \(c_1, \ldots, c_n\) are the coefficients of the objective function (i.e., costs), \(a_{11}, \ldots, a_{mn}\) are the coefficients of the constraints, and \(b_1, \ldots, b_m\) are the right-hand side of the constraints.

The simplex algorithm to solve the problem consists in the next steps:

  1. If \(c_j \geq 0\) for all \(j\), then the current solution is optimal. Basic variables are equal to \(b_i\) and non-basic variables are equal to 0.
  2. If \(c_j < 0\) for some \(j\), we choose it to enter the base. We chose the variable with the most negative \(c_j\), let’s say that it is \(j = s\).
  3. If \(a_{is} \leq 0\) for all \(i\), then the problem is unbounded.
  4. If \(a_{is} > 0\) for some \(i\), we choose \(i = r\) such that \(\frac{b_r}{a_{rs}} = \min(\frac{b_i}{a_is},\: a_{is} 0)\) and pivot on \(a_{rs}\), to then go back to step 1.

The coefficients are updated according to:

  1. \(a_{ij} \leftarrow a_{ij} - \frac{a_{is} a_{rj}}{a_{rs}}\) for \(j \neq s\)
  2. \(a_{rj} \leftarrow \frac{a_{rj}}{a_{rs}}\)
  3. \(b_i \leftarrow b_i - \frac{a_{is} b_r}{a_{rs}}\) for \(i \neq r\)
  4. \(b_r \leftarrow \frac{b_r}{a_{rs}}\)
  5. \(c_j \leftarrow c_j - \frac{c_s a_{rj}}{a_{rs}}\)
  6. \(-z \leftarrow -z - \frac{c_s b_r}{a_{rs}}\)

This algorithm is equivalent to Gauss method to solve linear systems.

Numerical example

Let’s say we want to solve the following linear programming problem:

\[ \begin{aligned} \text{min} \quad & -x_1 - 3x_2 \\ \text{subject to} \quad & x_1 + x_2 \geq 3 \\ & -3x_1 + x_2 \geq 2 \\ & x_1, x_2 \geq 0 \end{aligned} \]

We need to re-write the problem in canonical form:

\[ \begin{aligned} \text{min} \quad & -x_1 - 3x_2 + 0x_3 + 0x_4 \\ \text{subject to} \quad & x_1 + x_2 + x_3 = 3 \\ & -3x_1 + x_2 + x_4 = 2 \\ & x_1, x_2,x_3,x_4 \geq 0 \end{aligned} \]

The initial tableau for this problem is:

\[ \begin{array}{cccc|c} x_1 & x_2 & x_3 & x_4 & -z \\ \hline -1 & -3 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 3 \\ -3 & 1 & 0 & 1 & 2 \end{array} \]

The first row is the cost row, the last column is the right-hand side, and the rest is the matrix \(A\).

We pivot on row 2, column 2:

\[ \begin{array}{cccc|c} x_1 & x_2 & x_3 & x_4 & -z \\ \hline -10 & 0 & 0 & 3 & 6 \\ 4 & 0 & 1 & -1 & 1 \\ -3 & 1 & 0 & 1 & 2 \end{array} \]

Then, we pivot on row 2, column 1:

\[ \begin{array}{cccc|c} x_1 & x_2 & x_3 & x_4 & -z \\ \hline 0 & 0 & 5/2 & 1/2 & 17/2 \\ 1 & 0 & 1/4 & -1/4 & 1/4 \\ 0 & 1 & 3/4 & 1/4 & 11/4 \end{array} \]

Here we reached a stopping criterion: the minimum cost is non-negative, so we have found an optimal solution, which is \(x^* = (\frac{1}{4}, \frac{11}{4}, 0 , 0)\) and the optimal value is \(z^* = -\frac{17}{2}\).

Building and testing

I used devtools to build and test the package:

## build

devtools::clean_dll()
cpp11::cpp_register()
devtools::document()
devtools::load_all()

## test

c <- c(-1, -3)
b <- c(3, 2)

A <- matrix(
    c(1, -3, 1, 1),
    nrow = 2,
    ncol = 2,
    byrow = FALSE
)

cpp11_simplex_phase2(c, b, A)

The result should be:

Initial tableau:
-1 -3  0  0  0 
 1  1  1  0  3 
-3  1  0  1  2 
Minimum cost: -3
Pivot row: 2
Pivot column: 2
====
New tableau:
-10  0  0  3  6 
 4  0  1 -1  1 
-3  1  0  1  2 
Minimum cost: -10
Pivot row: 1
Pivot column: 1
====
New tableau:
 0  0  2.5  0.5  8.5 
 1  0  0.25 -0.25  0.25 
 0  1  0.75  0.25  2.75 
Minimum cost: 0
Optimal solution found in 2 steps !

Using OMP (parallelization)

Originally available in my blog.

See the cpp11omp package.

Topics covered:

Motivation

One common phrase that I find when I need to Google how to do something with cpp11 is “Don’t use cpp11 because it does not offer OpenMP support.”

This is a myth. cpp11 does offer OpenMP support. In this blog post, I will show you how to use OpenMP with cpp11, but here I assume your C++ compiler already supports OpenMP.

I tested this on Windows, where you need to install Rtools, and Linux Mint (Ubuntu based) where I didn’t need anything special because the gcc compiler comes with the operating system and just works. If you are using macOS, you need to install libomp via Homebrew in order to extend the clang compiler, and this is explained here.

Creating a package

First, we need to create a package. I will use usethis to create a package called cpp11omp:

usethis::create_project("cpp11omp")

Then, I will add cpp11 as a dependency:

usethis::use_cpp11()

As the cpp11 message indicates, I created a file called R/cpp11omp-package.R with the following contents:

### usethis namespace: start
#' @useDynLib cpp11omp, .registration = TRUE
### usethis namespace: end
NULL

Cpp11 unnamed list

I added a function called squared_unnamed_ in src/code.cpp that will square each element in a vector of doubles, so the file content corresponds to the following:

#include <cpp11.hpp>
#include <omp.h>

using namespace cpp11;

[[cpp11::register]] list squared_unnamed_(doubles x) {
    // create vectors y = x^2 and z = thread number
    int n = x.size();
    writable::doubles y(n);
    writable::doubles z(n);
    #pragma omp parallel for
    for (int i = 0; i < n; ++i) {
        y[i] = x[i] * x[i];
        z[i] = omp_get_thread_num();
    }

    //create a list containing y and z
    writable::list out;
    out.push_back(y);
    out.push_back(z);
    return out;
}

The previous function returns an unnamed list with two elements: the squared vector and the thread number. The function is registered with [[cpp11::register]] so that it can be called from R.

If I try to run square_unnamed_(1:10), it will return an error because I am passing a vector of integers instead of doubles. C++ is strict with types, so I need to create a wrapper function that will convert the integers to doubles, and it will go inside R/cpp11omp-package.R:

#' Unnamed list with squared numbers and the threads used
#' @param x A vector of doubles
#' @export
squared_unnamed <- function(x) {
  squared_unnamed_(as.double(x))
}

The previous function is exported with @export so that it can be called by the end user. The function squared_unnamed_ is an internal function. This approach also has the advantage that I can document the function in a flexible way.

Cpp11 named list

I added a function called squared_named_ in src/code.cpp that does the same but returns a named list. The additional content corresponds to the following:

[[cpp11::register]] list squared_named_(doubles x) {
    // create vectors y = x^2 and z = thread number
    int n = x.size();
    writable::doubles y(n);
    writable::doubles z(n);
    #pragma omp parallel for
    for (int i = 0; i < n; ++i) {
        y[i] = x[i] * x[i];
        z[i] = omp_get_thread_num();
    }

    //create a list containing y and z
    writable::list out;
    out.push_back({"x^2"_nm = y});
    out.push_back({"thread"_nm = z});
    return out;
}

As in the previous part, I added a wrapper and documentation:

#' Named list with squared numbers and the threads used
#' @param x A vector of doubles
#' @export
squared_named <- function(x) {
  squared_named_(as.double(x))
}

Makevars

In order to make the #pragma instruction work, I need to add the following to src/Makevars:

PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)
CXX_STD = CXX11

If I don’t do this, the pragma instruction will be ignored and the functions will run in a single thread.

Building and testing

I used devtools to build and test the package:

cpp11::cpp_register()
devtools::document()
devtools::install()

Then, I tested the package from a new R session:

> library(cpp11omp)
> squared_unnamed(1:10)
[[1]]
 [1]   1   4   9  16  25  36  49  64  81 100

[[2]]
 [1] 0 0 1 1 2 3 4 5 6 7

> squared_named(1:10)
$`x^2`
 [1]   1   4   9  16  25  36  49  64  81 100

$thread
 [1] 0 0 1 1 2 3 4 5 6 7

Real-life examples

  1. Arrow: An excellent package that uses pure C++ code called from R by using the cpp11 to read and write Apache Arrow files. The package is well documented and has a lot of tests. The package is also well maintained and has a lot of contributors.

  2. Cpp11armadillo: A package that uses cpp11 to call C++ code from R to perform matrix operations using the Armadillo library. Armadillo is a very well documented C++ library with a syntax similar to MATLAB and it has been tested and improved for over ten years.

  3. Kendallknight: A pure C++ implementation of the Kendall’s correlation coefficient. The algorithm is called in R by using the cpp11 package and it provides a speedup because the complexity goes from \(O(n^2)\) to \(O(n \log(n))\) by counting inversions instead of using brute force.

  4. Redatam: A pure C++ implementation of the Redatam file format. The algorithm is called in R by using the cpp11 package and writes to an R list of data frames. Because of this cross-languages integration, it was very simple to also provide a Python package that calls C++ as the R package does.

  5. RPostgres: An amazing C++ Interface to PostgreSQL that uses cpp11 to call C++ code from R. The package is well documented and has a lot of tests. The package is also well maintained and has a lot of contributors.

References